Eating a healthy and balanced diet is essential for maintaining a healthy and nutritious body. The “ALL DIETS” dataset consists of common recipes across various diets and lists their macro nutrients or macros (carbohydrate or carbs, protein, and fat). The dataset was retrieved from Kaggle, link available here.
The data set outlines the macros for 5 common DIET plans consumed by a vast majority of the world population. These 5 Diets as described below will be used as one of the categorical data.
1. Paleo diet: Paleo diet is a dietary approach inspired by what our hunter-gatherer ancestors are believed to have eaten during the Paleolithic era 10,000 years ago. The paleo diet includes whole, unprocessed foods like:meat,fish,eggs,fruits,vegetables.It excludes dairy products,grains,legumes,refined sugars etc.
2. Dash diet: This diet claims to be a balanced diet designed to help lower blood pressure. It includes fruits, vegetables, whole grains, and low-fat dairy products. It limits red meat, salt, and added sugars. The diet focuses on eating food that is low on fat.
3. Keto diet: The ketogenic diet is a low-carbohydrate, high-fat diet that involves drastically reducing carbohydrate intake and replacing it with fat. This reduction in carbs puts your body into a metabolic state called ketosis. The diet usually has meat recipes and full fat ingredients like cheese and cream.
4. Mediterranean diet: This diet emphasizes fruits, vegetables, whole grains, legumes, fish, and healthy fats, with moderate amounts of poultry and dairy. It has been linked to a number of health benefits, including a reduced risk of heart disease, stroke, and type 2 diabetes.
5. Vegan diet: A vegan diet excludes all animal products, including meat, poultry, fish, eggs, dairy products, and honey. It emphasizes plant-based foods such as fruits, vegetables, legumes (beans, lentils, peas), nuts and seeds, whole grains, and plant-based milk alternatives.
The data set has a column containing 19 different Cuisine types that will be used as the second categorical data in the analysis. “world” cuisine will not be considered under analysis.
The 3 Macros columns for carbohydrate or carbs, protein, and fat in each recipe will be used as a numerical data type in the analysis below.
Eating nutritious and healthy food impacts various aspects of our life offering both long term and short term benefits. With the sheer volume of information available online on different diet types, navigating through the latest trends and marketing gimmicks can get overwhelming. Some of these diets promise quick and dramatic weight loss or specific health benefits, but they typically lack scientific evidence and can be unhealthy or unsustainable in the long term.
Goal is to analyze macros distribution across the diets and infer balanced diets available without certain macros nutrients being eliminated.
# CS544 FINAL PROJECT_Negi
library(tidyverse)
library(plotly)
library(sampling)
options(scipen = 7)
setwd("~/Library/CloudStorage/OneDrive-Personal/!BU/CS544/Final Project/CS544Final_Negi")
local.file <- read.csv("All_Diets.csv",header = T,
colClasses = c("character", "character","character","double", "double", "double","character","character"))
All.diets<-as_tibble(local.file)
select(All.diets,Diet=Diet_type,Recipe_name,Cuisine_type,
Protein=Protein.g.,Carbs=Carbs.g.,Fat=Fat.g.)->All.dietsPLOT 1
From the barplot we can infer that majority of the recipes are from Mediterranean and Dash diets and may likely be more popular out of the 5 Diets.
In this section we look at the frequency distribution of the cuisine types as well as the the distribution of the diet with the cuisine. There are a total of 19 cuisines and 5 Diet types in the data set.
PLOT 2
From the above barplots we can infer that the American cuisine has the highest frequency of data points so this data set may be skewed towards the American cuisine. This could be because the data set was collected in America where the predominant cuisine is American. The other cuisines may not have enough data to show the Diet distribution in them. The American cuisine almost has an equal spread in the Diet distribution which shows that food catering to each of the Diet types is easily available in American cuisine (and therefore in America).
This next plot shows the percentages of distributions of the diets across the cuisines. This is important, because as we saw before in Plot 2, the American cuisines has disproportionate number of samples so it was hard to interpret the distributions of the diet types in the other cuisines.
PLOT 3
The plot 3 above shows the percentage distribution of the Diets in the cuisine type. Here we can infer that Kosher cuisine has no vegan dietary options.The Kosher cuisine had only 2 Diets: Paleo and Dash. Chinese and British cuisines have the least amount of Mediterranean food options. Indian and South American cuisines have the highest Vegan food options. Caribbean and British recipes have a lot of Keto options. Bread and Rice may not be used in every recipe in these cuisines. Dash Diet seems like the most common diet type among all the recipes as it claims to be a balanced diet that does not eliminate any food group (from it’s definition.) Asian cuisines (like the Chinese and Japanese cuisines) offer a large variety of vegan food options as their recipes do not include milk in them. Milk products could be substituted with Soy milk which is predominantly used in these cuisines.
Cuisines with a large spread of Dash and Vegan diets are the Kosher, Indian, Japanese and Asian cuisines. These cuisines include a lot of Rice and Wheat ingredients.
French, Italian and Mexican cuisines show a similar spread of the 5 diets.
The distribution of Macros for each recipe is shown using Box plots. The macros for the recipe is converted as a percent of the total Macro. Percentages are used instead of the absolute values as the portion size may vary for each recipe and may not present an accurate representation for each recipe.
PLOT 4
From the plot above, the Dash diet has a large boxplot of Carbs content. However the median distribution of Carbs for vegan diet is higher than the median distribution of Carbs for the Dash diet. This means that there are higher values of Carbs in the Vegan recipes compared to the Dash diet. The protein content looks equivalent among the 4 diets that include meat in them i.e, Keto, Mediterranean, Dash, Paleo. Protein is the lowest in the Vegan diet. Mediterranean and Paleo diets look like the most balanced diets with similar box plots for the 3 Macros.
Here we look at the distribution of the protein, fat and carbs content across the Diets. Absolute values of the sum of each Macros column is used for the analysis below.
The distribution of the protein, fat and carbs content per Diet is seen with the help of a pie chart.
PLOT 5a
PLOT 5b
PLOT 5c
PLOT 5d
PLOT 5e
From the above pie charts, Paleo and Mediterranean diets look balanced with an equal spread in the Macros (similar to the box plots in Plot 4). Vegan and Dash diets have high Carbs whereas the Protein content in the Vegan diet is the lowest.
The next pie charts are the Macro comparisons for the 5 Diet types. The absolute value of the Sum of each Macros column is used for the charts below.
PLOT 6a
PLOT 6b
PLOT 6c
From the above pie charts, Keto diet has the highest fat content and vegan has the least fat content. Keto and Mediterranean diets (with predominantly meat as its ingredients) have the highest protein content.
Pie charts as well as the box plots show similar data trends.
The Central Limit Theorem states that when repeated random samples of equal size from a population are taken, the distribution of the sample means will tend to be normally distributed as the sample size increases. The term “Population” is used for the entire or the original data set. A “sample” is taken from the “Population” as a subset of the “Population”. In the following analysis, 4 different sizes (Size 10,20,30,40) of samples are taken 10,000 times each. The mean of the the 10,000 samples are analysed for the 4 sizes.
## The Mean of Carbs for the Population is = 152.12 | The SD of Carbs for the Population is = 185.91
## The Mean of Protein for the Population is = 83.23 | The SD of Protein for the Population is = 89.8
## The Mean of Fat for the Population is = 117.33 | The SD of Fat for the Population is = 122.1
## Samples from Carbs
## Sample Size = 10 Sample Mean = 151.92 Sample SD = 59.21 Theoretical SD = 58.79
## Sample Size = 20 Sample Mean = 152.02 Sample SD = 41.47 Theoretical SD = 41.57
## Sample Size = 30 Sample Mean = 152.43 Sample SD = 34.5 Theoretical SD = 33.94
## Sample Size = 40 Sample Mean = 152.18 Sample SD = 29.54 Theoretical SD = 29.39
PLOT 7a
## Samples from Protein
## Sample Size = 10 Sample Mean = 83.46 Sample SD = 28.34 Theoretical SD = 58.79
## Sample Size = 20 Sample Mean = 83.1 Sample SD = 19.92 Theoretical SD = 41.57
## Sample Size = 30 Sample Mean = 83.16 Sample SD = 16.31 Theoretical SD = 33.94
## Sample Size = 40 Sample Mean = 83.29 Sample SD = 13.98 Theoretical SD = 29.39
PLOT 7b
## Samples from FAT
## Sample Size = 10 Sample Mean = 117 Sample SD = 38.6 Theoretical SD = 58.79
## Sample Size = 20 Sample Mean = 117.34 Sample SD = 27.01 Theoretical SD = 41.57
## Sample Size = 30 Sample Mean = 117.71 Sample SD = 22.31 Theoretical SD = 33.94
## Sample Size = 40 Sample Mean = 117.44 Sample SD = 19.62 Theoretical SD = 29.39
PLOT 7c
The above plots show that frequency of the 10,000 sample means for all 4 sample sizes follows a normal distribution curve with maximum frequency occurring at the sample mean of the original data (population). The Standard deviation reduces as the sample size increases and gets closer to the Theoretical Standard deviation.
Sampling is the process of selecting a subset (called a sample) from a larger group (called a population). Samples help in drawing important inferences about the entire population and often save time as they are easier to handle compared to the entire population. The sample usually represents the total population. Three sampling techniques are used in this project as explained below:
Simple Random Sampling: Each data entry has an equal chance of being chosen.
Stratified Sampling: The population is divided into groups (strata) based on a specific characteristic, and then a random sample is drawn from each group. Here, the samples are proportioned to the frequency of the Diet type.
Systematic Sampling: Individuals are selected at regular intervals from a list of the population.
## For random samples using simple random sampling without replacement, the number of samples taken from each diet is:
## Dash: 212
## Keto: 202
## Mediterranean: 257
## Paleo: 158
## Vegan: 171
## Total number of samples taken in this method: 1000
## For random samples using the Systematic sampling method, the number of samples taken from each diet is:
## Dash: 218
## Keto: 189
## Mediterranean: 219
## Paleo: 159
## Vegan: 190
## Total number of samples taken in this method: 1000
## For random samples using the Stratified sampling method, the number of samples taken from each diet is:
## Dash: 195
## Keto: 225
## Mediterranean: 163
## Paleo: 224
## Vegan: 194
## Total number of samples taken in this method: 1001
PLOT 8a
PLOT 8b
PLOT 8c
The above boxplots use samples from the original data as data points. The plots show similar trends as the original data (total population) in the macros distribution across Diets. Similar to the boxplot for population (Plot 4), the Mediterranean and Paleo diets look the most balanced with similar box plots for the 3 Macros. The median distribution of Carbs for Vegan diet is higher than the median distribution of Carbs for Dash diet.
The box plots (Plot 4 and 8) and the pie charts (Plot 5 and 6) both show that the Mediterranean and the Paleo diets are much more balanced compared to the other diets. There is no spike in the Macros in these diets. Paleo diet however, excludes some food groups entirely like dairy products, legumes, sugars and processed foods.
Mediterranean and the Paleo diets are available in all the cuisines listed in the data with the majority found in Eastern European and Mediterranean Cuisines.